Recognition and Transliteration of Proper Nouns in Cross-Language Record Linkage by Constructing Transliterated Word Pairs

نویسندگان

  • Yuting Song
  • Biligsaikhan Batjargal
  • Akira Maeda
چکیده

Proper nouns in metadata are representative features for linking the identical records across data sources in different languages. To improve the recognition of proper nouns in metadata and obtain their transliterations, we propose a method to construct bilingual transliteration word pairs, in which transliterated words in target language are back-transliterated to their original words in source language. The acquired transliterated word pairs are employed to recognize and transliterate proper nouns in metadata. We evaluated our proposed method on the task of cross-language record linkage between a Japanese database and an English database. Experimental results show the usage of the transliterated word pairs that we have obtained can improve the effectiveness of cross-language record linkage.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach to Build a Proper Noun Dictionary for Record Linkage across Humanities Databases in Different Languages

This paper proposes a method to build a proper noun dictionary that contains proper noun and its transliterated word pairs, which can be used for record linkage across humanities databases in different languages. This method is based on an observation that the corresponding word of a proper noun in a target language is almost the same as transliteration of that proper noun. In this paper, the p...

متن کامل

English-Arabic Transliteration

Proper nouns may be considered as the most important query words in information retrieval. If the two languages use the same alphabet, the same proper nouns can be found in either language. However, if the two languages use different alphabets, the names must be transliterated. Short vowels are not usually marked on the Arabic words in almost all Arabic documents (except very important document...

متن کامل

Incorporating Pronunciation Variation into Extraction of Transliterated-term Pairs from Web Corpora

A novel approach to automatically extracting transliterated-term pairs from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking pronunciation variation into account. Pronunciation variation is a phenomenon of pronunciation ambiguity that seriously affects the term transliteration and hence affects those results produced by transliteration processe...

متن کامل

Improving Cross-Language Information Retrieval by Transliteration Mining and Generation

The retrieval performance of Cross-Language Retrieval (CLIR) systems is a function of the coverage of the translation lexicon used by them. Unfortunately, most translation lexicons do not provide a good coverage of proper nouns and common nouns which are often the most information-bearing terms in a query. As a consequence, many queries cannot be translated without a substantial loss of informa...

متن کامل

Transliteration Using a Network of Phoneme Chunks

In this paper, we present methods of transliteration and back-transliteration. In Korean technical documents and web documents, many English words and Japanese words are transliterated into Korean words. These transliterated words are usually technical terms and proper nouns, so it is hard to find them in a dictionary. Therefore an automatic transliteration system is needed. Previous transliter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018